Search CORE

4 research outputs found

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

Author: Feng Chao
Shou Mike Zheng
Song Zeyang
Tsutsui Satoshi
Xu Eric Zhongcong
Ye Mang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/07/2022
Field of study

Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at https://github.com/showlab/AVA-AVD.Comment: ACMMM 202

arXiv.org e-Print Archive

PV3D: A 3D Generative Model for Portrait Video Generation

Author: Bai Song
Feng Jiashi
Liew Jun Hao
Shou Mike Zheng
Xu Eric Zhongcong
Zhang Jianfeng
Zhang Wenqing
Publication venue
Publication date: 13/12/2022
Field of study

Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics to the generation process, we develop a motion generator by stacking multiple motion layers to generate motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multi-view consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3D-aware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing. Code and models will be released at https://showlab.github.io/pv3d

arXiv.org e-Print Archive

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

Author: Cao Yan-Pei
Keppo Jussi
Liu Jia-Wei
Qie Xiaohu
Shan Ying
Shou Mike Zheng
Xu Eric Zhongcong
Yang Tianyuan
Publication venue
Publication date: 24/04/2023
Field of study

We introduce HOSNeRF, a novel 360{\deg} free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in human-object interactions, which we tackle by introducing the new object bones into the conventional human skeleton hierarchy to effectively estimate large object deformations in our dynamic human-object model. The second challenge is that humans interact with different objects at different times, for which we introduce two new learnable object state embeddings that can be used as conditions for learning our human-object representation and scene representation, respectively. Extensive experiments show that HOSNeRF significantly outperforms SOTA approaches on two challenging datasets by a large margin of 40% ~ 50% in terms of LPIPS. The code, data, and compelling examples of 360{\deg} free-viewpoint renderings from single videos will be released in https://showlab.github.io/HOSNeRF.Comment: Project page: https://showlab.github.io/HOSNeR

arXiv.org e-Print Archive

Egocentric Video-Language Pretraining

Author: Damen Dima
Gao Difei
Ghanem Bernard
Lin Kevin Qinghong
Liu Wei
Shou Mike Zheng
Soldan Mattia
Wang Alex Jinpeng
Wray Michael
Xu Eric Zhongcong
Yan Rui
Publication venue
Publication date: 12/12/2022
Field of study

Explore Bristol Research